개괄적인 작업흐름은 데이터프레임 정형데이터에서 pandoc 계열 도구와 마크다운, \(LaTex\) 언어 등을 활용 다양한 문서를 만들어내는 것이고 이렇게 다양한 문서 데이터를 다시 다양한 도구와 방법론을 사용해서 정형데이터로 만들어 낸다.
예제 데이터로 공개된 영문 Awesome CV is LaTeX template for your outstanding job application PDF 파일을 사용한다.
첫번째 페이지
library(pdftools)
library(magick)
resume_first_png <- pdf_render_page("data/resume.pdf", page = 1, dpi = 300, numeric = FALSE)
image_read(resume_first_png)두번째 페이지
resume_second_png <- pdf_render_page("data/resume.pdf", page = 2, dpi = 300, numeric = FALSE)
image_read(resume_second_png)반정형 이력서 PDF 파일에서 데이터프레임을 추출한다.
library(pdftools)
library(tidyverse)
cv_dat <- pdf_text("data/resume.pdf")
cv_dat <- paste0(unlist(cv_dat), collapse = "")
cv_split_dat <- cv_dat %>%
str_split(pattern="\r\n") %>%
.[[1]]
# 인적사항 -------------------
`인적사항_idx` <- cv_split_dat %>%
str_detect("Summary") %>%
which(TRUE)
인적사항 <- cv_split_dat[1:(`인적사항_idx`-1)]
# 요약("Summary") -------------------
`요약_idx` <- cv_split_dat %>%
str_detect("Work Experience") %>%
which(TRUE)
요약 <- cv_split_dat[(`인적사항_idx`+1):(`요약_idx`-1)]
# 직장경력("Work Experience") -------------------
`직장경력_idx` <- cv_split_dat %>%
str_detect("Honors & Awards") %>%
which(TRUE)
직장경력 <- cv_split_dat[(`요약_idx`+1):(`직장경력_idx`-1)]
# 수상이력 ("Honors & Awards") -------------------
`수상이력_idx` <- cv_split_dat %>%
str_detect("Presentation") %>%
which(TRUE)
수상이력 <- cv_split_dat[(`직장경력_idx`+1):(`수상이력_idx`-1)]
# 발표("Presentation") -------------------
`발표_idx` <- cv_split_dat %>%
str_detect("Writing") %>%
which(TRUE)
발표 <- cv_split_dat[(`수상이력_idx`+1):(`발표_idx`-1)]
# 저서("Writing") -------------------
`저서_idx` <- cv_split_dat %>%
str_detect("Program Committees") %>%
which(TRUE)
저서 <- cv_split_dat[(`발표_idx`+1):(`저서_idx`-1)]
# 심사("Program Committees") -------------------
`심사_idx` <- cv_split_dat %>%
str_detect("Education") %>%
which(TRUE)
심사 <- cv_split_dat[(`저서_idx`+1):(`심사_idx`-1)]
# 학교("Education") -------------------
`학교_idx` <- cv_split_dat %>%
str_detect("Extracurricular") %>%
which(TRUE)
학교 <- cv_split_dat[(`심사_idx`+1):(`학교_idx`-1)]
# 특활활동("Extracurricular") -------------------
특활활동 <- cv_split_dat[(`학교_idx`+1):length(cv_split_dat)]
## 이력서 구분
cv_section_list <- list("인적사항" = 인적사항,
"요약" = 요약,
"직장경력" = 직장경력,
"수상이력" = 수상이력,
"발표" = 발표,
"저서" = 저서,
"심사" = 심사,
"학교"=학교,
"특활활동"=특활활동)
listviewer::jsonedit(cv_section_list)인적사항 <- str_trim(인적사항) %>% str_remove_all(pattern="\uf10b|\uf0e0|\uf015|\uf092|\uf08c")
이름 <- 인적사항[1]
직무 <- 인적사항[2]
주소 <- 인적사항[3]
개인정보 <- str_split(인적사항[4], " \\| ") %>% .[[1]]
전화번호 <- str_trim(개인정보[1])
전자우편 <- str_trim(개인정보[2])
홈페이지 <- str_trim(개인정보[3])
GitHub <- str_trim(개인정보[4])
링크트인 <- str_trim(개인정보[5])
인적사항_df <- tibble(
"이름" = 이름,
"직무" = 직무,
"주소" = 주소,
"전화번호" = 전화번호,
"전자우편" = 전자우편,
"홈페이지" = 홈페이지,
"Github" = GitHub,
"링크트인" = 링크트인
) 요약_df <- tibble(
"요약" = str_c(요약, collapse=" ")
)직장경력 [1] "Omnious. Co., Ltd. Seoul, S.Korea"
[2] "SOFTWARE ARCHITECT Jun. 2017 - May. 2018"
[3] "<U+2022> Provisioned an easily managable hybrid infrastructure(Amazon AWS + On-premise) utilizing IaC(Infrastructure as Code) tools like Ansible, Packer"
[4] " and Terraform."
[5] "<U+2022> Built fully automated CI/CD pipelines on CircleCI for containerized applications using Docker, AWS ECR and Rancher."
[6] "<U+2022> Designed an overall service architecture and pipelines of the Machine Learning based Fashion Tagging API SaaS product with the micro-services"
[7] " architecture."
[8] "<U+2022> Implemented several API microservices in Node.js Koa and in the serverless AWS Lambda functions."
[9] "<U+2022> Deployed a centralized logging environment(ELK, Filebeat, CloudWatch, S3) which gather log data from docker containers and AWS resources."
[10] "<U+2022> Deployed a centralized monitoring environment(Grafana, InfluxDB, CollectD) which gather system metrics as well as docker run-time metrics."
[11] "PLAT Corp. Seoul, S.Korea"
[12] "CO-FOUNDER & SOFTWARE ENGINEER Jan. 2016 - Jun. 2017"
[13] "<U+2022> Implemented RESTful API server for car rental booking application(CARPLAT in Google Play)."
[14] "<U+2022> Built and deployed overall service infrastructure utilizing Docker container, CircleCI, and several AWS stack(Including EC2, ECS, Route 53, S3,"
[15] " CloudFront, RDS, ElastiCache, IAM), focusing on high-availability, fault tolerance, and auto-scaling."
[16] "<U+2022> Developed an easy-to-use Payment module which connects to major PG(Payment Gateway) companies in Korea."
[17] "R.O.K Cyber Command, MND Seoul, S.Korea"
[18] "SOFTWARE ENGINEER & SECURITY RESEARCHER (COMPULSORY MILITARY SERVICE) Aug. 2014 - Apr. 2016"
[19] "<U+2022> Lead engineer on agent-less backtracking system that can discover client device’s fingerprint(including public and private IP) independently of"
[20] " the Proxy, VPN and NAT."
[21] "<U+2022> Implemented a distributed web stress test tool with high anonymity."
[22] "<U+2022> Implemented a military cooperation system which is web based real time messenger in Scala on Lift."
[23] "NEXON Seoul, S.Korea & LA, U.S.A"
[24] "GAME DEVELOPER INTERN AT GLOBAL INTERNSHIP PROGRAM Jan. 2013 - Feb. 2013"
[25] "<U+2022> Developed in Cocos2d-x an action puzzle game(Dragon Buster) targeting U.S. market."
[26] "<U+2022> Implemented API server which is communicating with game client and In-App Store, along with two other team members who wrote the game"
[27] " logic and designed game graphics."
[28] "<U+2022> Won the 2nd prize in final evaluation."
[29] "ShitOne Corp. Seoul, S.Korea"
[30] "SOFTWARE ENGINEER Dec. 2011 - Feb. 2012"
[31] "<U+2022> Developed a proxy drive smartphone application which connects proxy driver and customer."
[32] "<U+2022> Implemented overall Android application logic and wrote API server for community service, along with lead engineer who designed bidding"
[33] " protocol on raw socket and implemented API server for bidding."
[34] "SAMSUNG Electronics S.Korea"
[35] "FREELANCE PENETRATION TESTER Sep. 2013, Mar. 2011 - Oct. 2011"
[36] "<U+2022> Conducted penetration testing on SAMSUNG KNOX, which is solution for enterprise mobile security."
[37] "<U+2022> Conducted penetration testing on SAMSUNG Smart TV."